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INTRODUCTION 


Within  the  domain  of  Bayesian  researcn  on  Probabilistic  Information  Pro¬ 
cessing,  only  a  few  published  studies  have  had  tne  purpose  of  comparing  Dif¬ 
ferent  direct  estimation  response  procedures  (Kaplan  &  Newman,  i960;  Phillips 
&  Edwards,  I966;  Fujii,  I967).  Knowledge  in  this  area  becomes  more  important 
as  the  Bayesian  techniques  become  more  widely  used  in  real  world  applications. 
Some  of  those  using  this  technology  have  had  access  to  the  results  of  unpub - 
lisned  experiments  conducted  in  University  laboratories  or  have  done  their  own 
research  on  this  topic;  others  not. 

This  study  analyzes  the  data  from  five  separate  research  projects,  only 
one  previously  published,  in  order  to  examine  what  we  know  and  some  of  what 
we  don't  know  about  different  direct  estimation  procedures  for  eliciting  judg¬ 
ments  about  uncertainty. 

Since  all  of  these  studies  were  done  in  the  Engineering  Psychology  Labo¬ 
ratory,  [  have  in  each  case  been  able  to  reanalyze  the  raw  data.  I  am  grate¬ 
ful  to  my  colleagues  for  help  and  access. 
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THE  FRAMEWORK 


Four  orthogonal  independent  variables  have  been  manipulated  in  direct  es¬ 
timation  response  studies.  These  are  the  particular  kinds  of  response  mode  in 
which  S  is  asked  to  express  his  uncertainty  judgment,  the  cumulative  or  non- 
cumulative  nature  of  the  response,  the  particular  nature  of  the  scale  being 
use!  to  record  the  response,  and  the  particular  nature  of  the  additional  feed¬ 
back  given  to  S  while  S  is  in  the  process  of  deciding  upon  his  assessment, 
ihe  different  response  modes  that  have  been  examined  include  likelihood  ratios 
(LR),  odds  (ODDS),  and  probabilities  (PROB).  Some  experiments  have  systemat¬ 
ically  varied  whether  or  not  Ss  aggregate  their  uncertainty  assessments  over 
more  than  one  datum.  The  type  of  scale  on  which  Ss  record  their  estimates  has 
also  been  investigated.  Two  types  of  scale  need  to  be  distinguished,  predrawn 
logarithmically  spaced  scales  and  an  everything  else  category.  The  variable 
Additional  Feedback  can  itself  be  classified  on  the  basis  of  three  orthogonal 
considerations.  These  are  whether  the  additional  feedback  is  in  graphic  form 
or  not,  whether  it  is  current  system  opinion  based  on  just  the  one  datum  being 
evaluated  or  based  on  all  the  relevant  data  to  date,  and  whether  this  addi¬ 
tional  feedback  is  in  the  form  of  ODDS  or  PROB. 


These  four  variables  form  a  four  dimensional  framework  or  taxonomy  to 
classify  all  the  response  conditions  in  all  of  the  experiments  that  will  be 
discussed.  This  structure  is  presented  in  Table  1.  The  last  dimension,  Addi¬ 
tional  Feedback,  in  order  to  be  orthogonal  with  the  other  three  dimensions, 
spec i  t i ca.Lly  excludes  those  features  of  the  response  already  contained  in  a 


lescription  of  the  other  three  dimensions.  Consider  two  examples.  In  the 


TABLE  1 


TAXONOMY 


Category 

Definition 

DIMENSION  ]  : 

RES DONS E  MODE 

ODDS 

Ss  express  their  uncertainty  in  odds 
judgments 

LR 

Ss  express  tneir  uncertainty  in  ]  ikeli- 
nood  ratio  judgments 

PROB 

Ss  express  tneir  uncertainty  in  proba¬ 
bility  judgments 

DIMENSION  2: 

aggregation 

CUM 

Ss '  uncertainty  judgments  are  aggregated 
over  a  set  of  data 

NONCUM 

Ss '  uncertainty  judgments  are  for  a 
single  datum  only 

DIMENSION  y. 

SCALE 

LOG 

Ss  record  their  uncertainty  judgments  on 
a  predawn  logarithmically  spaced  scale 

VERBAL 

Ss  verbally  state  their  uncertainty  judg¬ 
ments,  or  write  them  down  in  a  blank  on 
a  form,  or  use  a  nonlogarithm  ic  scale 

DIMENSION  It: 

ADDITIONAL  FEEDBACK 

NONE 

No  additional  feedback 

G-C-P 

( Graph ic -Cumulat i  ve  - 
Probability) 

The  additional  feedback  is  a  bar  grapli 
display  of  the  probabilities  of  the  hy¬ 
potheses  under  consideration  implied  by 
the  uncertainty  judgments  both  for  the 
current  datum  and  for  all  previous  data 

G-N-P 

( Graphie-Noneumulat ive- 
Probability) 

The  additional  feedback  is  a  bar  graph 
display  of  the  probabilities  of  the  hy¬ 
potheses  under  consideration  implied  by 
the  uncertainty  judgments  for  the  cur¬ 
rent  datum  only 

V-C-P 

( Verbal-Cuinulative- 
Probability ) 

The  additional  feedback  is  a  nongraphic 
representation  of  the  probabilities  of 
the  hypotheses  under  consideration  im¬ 
plied  by  the  uncertainty  judgments  both 
for  the  current  datum  arid  for  all  previ¬ 
ous  data 

V-C-0 

( Verbal -Cumulat  ive-OJtls) 

The  additional  feedback  is  a  nongraphic 
representation  of  the  odds  implied  by 
the  uncertainty  judgments  both  for  the 
current  datum  and  for  all  previous  data 
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first,  S^  marks  his  ODDS  asstssments  on  a  logarithmically  spaced  scale  and  he 
receives  no  additional  feedback.  In  the  second,  S  tells  his  LR  assessments  to 
E  who  first  records  them  and  then  gives  back  to  S  the  current  odds  based  on 
S's  assessments  for  all  data.  These  two  situations  would  be  classified  in  the 
following  way: 


DIMENSION 

ODDS  ASSESSMENT 

LR  ASSESSMENT 

1: 

Response  Mode 

ODDS 

IN 

2: 

Aggregation 

CUM 

NONCUM 

3: 

Scale 

LOG 

VERBAL 

k: 

Additional  Feedback 

NONE 

V-C-0 

The  dimensions  in  this  taxonomy,  in  addition  to  providing  a  framework  for 
the  classification  of  different  response  situations,  capture  the  essence  of  a 
comparison  of  the  Bayesian  information  processing  approach  witli  other  ap¬ 
proaches.  Thus  the  distinctive  features  of  the  Bayesian  approach  provide  the 
pattern  which  has  guided  the  experiments  on  different  direct  estimation  proce¬ 
dures,  as  well  as  the  pattern  that  has  guided  other  experimentation  in  this 
area. 

The  Bayesian  approach  to  information  processing  has  two  distinctive  fea¬ 
tures.  First,  this  approach  assumes  that  information  processing  is  a  'divide 
and  conquer'  process.  In  other  words,  it  assumes  that  the  total  assessment 
job  should  be  broken  down  into  smaller  subtasks  whereby  the  impact  of  each 
datum  is  assessed  separately  arid  the  individual  assessments  are  aggregated  to¬ 
gether  mechanically.  The  proponents  of  the  Bayesian  viewpoint  believe  that 
this  division  of  the  whole  inference  task  into  smaller  units  not  only  makes 
the  inference  task  easier,  but  also  that  it  makes  the  final  inference  more 
accurate  because  S  can  make  better  assessments  of  the  subunits.  Thus  the 

li 


'divide  anil  conquer 1  label.  The  second  feature  of  a  Bayesian  Information  pro¬ 
cessing  approach  is  that  change  ox'  opinion  is  additive  on  a  logarithmic  scale. 
This  feature  is  in  a  sense  the  heart  of  Bayes's  Theorem.  The  implication  of 
this  feature  is  that  revision  of  opinion  can  be  graphically  represented  by  re¬ 
cording  the  series  of  likelihood  assessments  of  a  set  of  data  on  a  logarith¬ 
mically  spaced  scale  of  odds  or  probability.  In  fact  by  giving  S  a  running 
record  of  the  impact  of  his  assessments  on  a  logarithmically  spaceu  omis  or 
probability  scale,  one  displays  Bayes  'r  Theorem  to  him. 

If  no  differences  result  when  Ss  respond  in  LRs  rather  than  odds,  or  when 
Ss  respond  by  using  nonaggregated  assessments  rather  than  aggregated  assess¬ 
ments,  or  when  Ss  record  their  assessments  on  a  log  scale  rather  than  a  non- 
logarithmic  device,  or  when  Ss  do  not  receive  Bayes's  Theorem  transformations 
of  their  assessments  into  some  measure  of  the  current  likelihood  of  the  hy¬ 
potheses  rather  than  receiving  this  feedback,  then  the  formal  introduction  of 
Bayes's  Theorem  would  serve  no  purpose  for  these  Ss.  These  Ss  would  already 
by  responding  as  Bayes's  Theorem  would  predict.  If,  however,  any  of  these  ex¬ 
perimental  manipulations  do  result  in  assessment  differences,  then  the  formal 
introduction  of  Bayes's  Theorem  into  the  system  would  make  a  difference. 


THE  EXPERIMENTS 


All  five  experiments  that  will  be  considered  in  this  investigat i on  had 
the  common  purpose  of  studying  the  effects  of  different  direct  estimation  pro¬ 
cedures  for  eliciting  uncertainty  judgments.  Two  of  tne  studies  were  part  of 
the  large  scale  simulated  strategic  war  setting  experiments  conducted  b,v  Ward 
Edwards  and  his  colleagues  at  The  University  of  Micnigan  to  test  the  Bayesian 
information  processing  ideas.  The  simulation  environment  for  tnese  two  studies 
was  a  simplified  world  ten  years  into  the  future.  In  this  world  only  six  ne - 
tions  played  significant  political  and  military  roles.  These  were  China, 

Japan,  North  America,  Russia,  United  Arab  Republic  (a  territory  reaching  from 
the  Atlantic  to  India  dominated  by  a  prophet  wno  sparked  a  Moslem  revival), 
and  the  United  Confederation  of  European  States  (a  loose  economic  and  military 
confederation).  A  27-page  summary  of  the  history  of  the  world  gave  Ss  the 
background  information  they  needed  in  order  to  become  information  processors 
in  this  future  world.  The  history  was  designed  to  make  different  strategic 
war  hypotheses,  e.g.,  'Russia  and  China  are  about  to  attack  North  America,' 
as  well  as  a  'Peace  will  continue  to  prevail'  hypothesis  plausible.  The  list 
of  hypotheses  included  four  specific  possible  wars,  "some  other  major  conflict 
.is  about  to  break  out,"  and  the  peace  hypothesis.  The  Ss  assumed  the  role  of 
duty  operators  for  the  April  ‘jth,  7PM  to  6PM  shift.  These  duty  operators  were 
part  of  the  information  processing  system  that  served  the  Joint  Chiefs  of 
Staff.  These  processors  were  to  assume  that  they  were  located  in  the  basement 
of  the  Pentagon. 

Three  sensors  delivered  data  to  this  information  processing  system.  These 
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were  the  Ballistic  Missile  Early  Warning  System  (BMEWS),  the  intelligence  sys¬ 
tem,  and  a  photo-reconnaissance  satellite  system.  BMEWS,  a  large  computerized 
radar  system  with  three  sites,  was  a  degraded  version  of  the  present  opera¬ 
tional  BMEWS  system.  The  intelligence  system  was  assumed  to  consist  of  spies, 
military  attaches  in  U.S.  embassies  abroad,  readers  of  foreign  newspapers,  and 
experts  on  foreign  affairs.  Each  intelligence  datum  was  a  report  of  an  event 
usually  accompanied  by  a  brief  qualitative  statement  about  the  degree  to  wnich 
the  event  was  surprising  and  what  it  meant.  The  photo-reconnaissance  satel¬ 
lite  system  was  assumed  to  consist  of  80  satellites.  The  information  proces¬ 
sors  received  satellite  system  data  that  consisted  of  reports  of  particular 
events  along  with  background  information  of  the  kind  that  might  be  obtained  by 
comparing  recent  photographs  with  previous  ones.  A  typical  satellite  syster 
report  was  the  following: 

"At  O63O  this  morning,  two  squadrons  of  conventional  sub¬ 
marines  sailed  from  Vladivostok.  They  steamed  in  a  south¬ 
erly  direction  until  they  were  clear  of  the  harbor,  and 
then  submerged.  Evaluation:  probably  a  routine  exercise 
although  this  is  an  unusually  large  Lojce. 

At  each  session  Ss  assessed  a  set  of  five  independent  LRs  for  each  01  tne 


oO  data  comprising  a  scenario.  Altogether  there  were  nine  scenarios. 

The  results  of  one  of  these  experiments  was  published  in  19o8  (Suwar  s, 
Phillips ,  lla.vs  ft  doodmati ) .  This  experiment,  called  PIFID.  compared  tne  ei- 
L’ects  of  two  kinds  of  additional  feedback,  i-N-1  an  1  t-n-l  ,  01  tne  peiloii.k. 
of  Ss  making  LR  assessments  on  a  log  scale.  Subjects  recorded  their  assess¬ 


ments  on  a  set  of  five  scales,  each  having  a  lever  mechanism  that  slid  along 
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the  scale.  A  set  of  scales  consisted  of  six  different  ranges  of  logarith¬ 
mically  marked  divisions.  The  S  could  turn  a  knob  to  select  any  one  of  the 
six  different  ranges.  The  first  range  extended  from  i:l  to  10:1,  the  second 
extended  from  10:1  to  100:1,  and  so  on  to  1,000,000:1.  Li,  front  of  Ss  in  both 
groups  was  a  cathode  ray  tube  (CRT)  placed  above  tiie  levers.  The  hr.  in  the 
group  receiving  the  d-N-F  additional  feedback  saw  displayed  on  the  CRT  a  bar 
graph  representation  of  the  posterior  probabilities  of  the  hypotheses  based 
upon  their  current  assessments  only  and  equal  prior  probabilities.  This  trans¬ 
formation  of  the  current  set  of  LR  judgments  changed  dynamically  as  an  S  either 
move!  the  levers,  or  reset  a  switch  that  indicated  under  which  hypothesis  the 
datum  was  more  likely,  or  turned  the  knob  that  changed  the  scale  range.  The 
Ss  in  the  group  receiving  the  G-C-P  additional  feedback  saw  on  the  CRT  a  dis¬ 
play  of  the  posterior  probabilities  of  the  hypothesis  based  on  their  current 
LR  assessments,  all  assessments  for  the  previous  data  in  that  scenario,  and  a 
prior  probability  distribution  of  .10  for  each  of  the  war  hypotheses  and  ,‘j0 
for  the  peace  hypothesis.  This  transformation  of  the  current  set  of  judgments, 
of  all  previous  judgments,  and  of  the  unequal  prior  distribution  also  changed 
dynamically  as  £3  either  moved  the  levers,  or  reset  the  switch,  or  turned  the 
knob  changing  the  scale  range. 

This  experiment  used  a  between-S  design.  Each  of  the  11  Ss,  five  in  the 
Ci-N-P  group  and  six  in  the  0,-C-P  group,  was  trained,  was  run  individually, 
completed  one  scenario  per  session,  and  had  no  more  than  one  session  per  day. 

According  to  the  framework  of  this  investigation,  the  two  groups  of  Ss 
in  this  particular  experiment  are  classified  in  the  following  way: 


DIMENSION 

Ss  RECEIVING  i-N-r 

Ss  RECEIVING  i-C-P 

1  :  Response  Mode 

l.l< 

hR 

U:  Aggregation 

NUNCIIM 

NONCUM 

:  Sea  1 e 

1,0  U 

hOG 

h:  Additional  Feedback 

U-N-P 

1 

o 

1 

Ill  the  other  strategic  war  simulation  experiment,  called  CORNOC,  three 
different  response  situations  were  compared  in  a  between-S  design.  One  group 
of  Ss ,  the  MS  (display)  group,  had  exactly  the  same  respoi.se  task  as  the 
group  of  Ss  in  the  PIPID  experiment  that  received  the  G-N-P  additional  fe»-  i- 
back.  Kach  S  in  this  group  assessed  five  LRs  per  datum  for  each  of  oO  uata 
in  the  same  nine  scenarios.  An  S  recorded  his  assessments  on  the  same  five 
sets  of  logarithmically  marked  scales.  The  display  on  the  CRT  fed  back  to  S 
the  posterior  probabilities  of  the  hypotheses  under  consideration  based  upon 
the  LR  assessments  for  the  current  datum  and  upon  an  equal  prior  distribution. 
Moreover,  this  display  also  changed  dynamically  as  an  S  moved  any  one  of  the 
levers,  or  reset  any  of  the'  switches,  or  turned  any  of  the  knobs  to  change  a 
scale  range.  A  second  group  of  Ss,  called  the  LEV  (lever)  group,  had  the  same 
task  as  the  DIS  group  except  that  there  was  no  display  on  the  CRT.  Conse¬ 
quently,  this  group  of  Ss  had  no  additional  feedback.  Each  S  recorded  his  5 
LRs  per  datum  on  the  logarithmically  spaced  scales  an  1  then  went  on  to  assess 
the  next  datum  in  the  scenario.  The  third  group  of  Ss,  called  the  NOC  (no 
computer)  group,  iid  not  have  any  aids.  Each  S  in  this  group  recorded  the  p 
hR  assessments  per  datum  on  a  single  sheet  of  paper,  in  appropriately  labeled,, 
blank  spaces,  and  then  began  a  new  sheet  for  the  next  datum.  The  same  nine 
scenarios  used  for  the  other  groups  of  Ss  were  used  also  for  this  group. 

Twenty  one  male  University  of  Michigan  students  served  as  Ss.  There  were 
l  os  ii  the  1  IS  group,  n  Ss  in  the  LEV  group,  and  8  Ss  in  the  NOC  group.  Each 
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£  was  trained  for  approximately  eight  hours  on  the  'future  world's'  history 
and  political  environmei  t  an  i  eight  hours  on  liow  to  do  LR  assessment.  Each 
was  run  individually,  completed  01  e  scenario  per  session  and  hud  one  session 
per  day.  All  21  Ss  repeated  at  least  one  scenario  and  ,LO  of  these  8:;  repeat 
four  scenarios. 

According  to  the  taxonomy  introduced,  the  three  groups  of  Ss  in  this  ex 
periment  are  classified  as  follows: 


DIMENSION 

I. :  Response  Mode 
2:  Aggregation 
>:  Scale 

i:  Additional  Feedback 


DIS  Ss 

LEV  Ss 

NOC  Ss 

LR 

LR 

LR 

NONCUM 

NONCUM 

NONCUM 

LOG 

LOG 

VERBAL 

G-N-P 

NONE 

NONE 

The  third  experiment  in  this  investigation,  labeled  W&E,  was  done  by 
Gloria  Wheeler  and  Ward  Edwards  (in  preparation).  This  experiment  used  a 
wit hin-S  design  to  .investigate  the  effects  of  four  different  response  situa¬ 
tions.  These  situations  Were  NONCUM  LR,  c*UM  L)T,  NONCUM  01)1)8,  and  CUM  ODDS. 
The  Ss  in  this  experiment  were  assigned  to  one  of  four  groups,  Each  group 
made  the  four  different  types  of  assessment  in  a  different  sequential  order. 

The  stimuli  in  this  experiment  were  7"  sticks  painted  with  a  blue  part 
and  a  yellow  part.  The  blue  and  yellow  occurred  in  various  proportions  in 
different  sticks.  Two  populations  were  defined.  The  populations  were  piles 
oi  sticks  with  the  amount  of  blue  and  yellow  paint  as  the  random  variable. 
Kach  of  the  populations  had  Gaussian  (normal)  distributions.  One  population 
dad  a  mean  length  of  blue  of  l|.y,  the  other  had  a  mean  length  of  blue  of 
2.h".  Kach  population  had  a  standard  deviation  of  i.g'y"  of  blue. 

Let  d)  be  defined  as  m^  -  m /s,  where  in^  is  the  mean  of  one  population, 


1(.) 


tn  is  the  mean  of  the  second  population,  and  s  is  the  standard  deviation  oi 
both  populations.  For  this  experiment  d '  !.*»,  which  yields  a  veridical  l.K 

at  the  mean  of  either  population  of  i.i>u. 

bight  sequences  of  ten  sticks  each  were  determined  by  drawings  from  a 
table  of  random  normal  deviates.  Some  of  these  sequences  were  then  sligntly 
modified  so  that  there  was  a  fairly  large  range  in  veridical  final  posterior 
odds  and  so  that  they  looked  random.  From  these  eight  'random  normal  deviate' 
sequences,  16  physically  different  sequences  were  constructed,  eight  from  the 
predominantly  blue  population  and  eight  from  the  predominantly  yellow  popula¬ 
tion.  Kvery  S  saw  every  sequence  twice,  for  a  total  of  ;i2  sequences.  Kacii  S 
for  each  different  response  situation  saw  eight  sequences.  The  id  sequences 


were  or 


dered  randomly  and  that  same  order  was  used  throughout  the  whole  exper¬ 


iment  for  every  S,  regardless  of  the  order  in  which  he  made  the  different  re¬ 
sponse  assessments. 

The  population  characteristics  were  displayed  to  the  Ss  by  two  charts. 

To  prepare  the  charts,  the  cumulative  normal  distribution  for  each  population 
was  divided  into  100  equally  likely  parts.  The  mean  lengths  of  blue  at  97  of 
tiie  boundary  points,  randomly  arranged,  comprised  tiie  charts. 

Responses  were  made  In  10-page  booklets,  one  response  per  page.  On  each 

page  was  printed,  " _ :1  ii  favor  of  hypothesis  _ ."  The  Ss  were  briefly 

instructed  in  each  new  response  situation  prior  to  beginning  that  type  of  re¬ 
sponse.  Thirty-six  male  University  students,  run  individually  or  in  pairs, 
served  as  Ss  for  this  experiment. 

The  four  different  response  situations  in  this  experiment  are  classifie  i 


in  the  following  way: 
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DIMENSION 


1: 

Response  Mo.le 

LR 

Ut 

ODDS 

ODDS 

2: 

Aggregation 

NONCUM 

CUM 

NONCUM 

CUM 

y: 

Scale 

VERBAL 

VERBAL 

VERBAL 

VERBAL 

4: 

Additional  Feedback 

NONE 

NONE 

NONE 

NONE 

The  W&E  experiment  is  the  only  experiment  that  I  know  about  that  studied 
a  CUM  LR  response.  Part  of  the  instruction  to  each  S  was  a  written  specifica¬ 
tion  of  the  task  on  a  "Cumulative  LR  Instruction  Sheet."  To  illustrate  the 
nature  of  the  inference  being  asked  for,  I  quote  the  i  blowing  from  this  sheet 
"...We  will  proceed  as  follows.  First  I  will  show  you  a 
single  stick,  and  you  wi_L  estimate  a  likelihood  ratio  for 
that  stick.  Then  I  will  show  you  another  stick,  and  I 
want  you  to  evaluate  the  likelihood  ratio  of  both  sticks. 

Formally,  I  am  asking  what  is  the  likelihood  ratio  for  two 
sticks.  Forget  that  you  saw  the  first  stick  by  itself, 
and  forget  the  estimate  you  made.  Pretend  that  I  drew  the 
two  sticks  simultaneously  from  somewhere. . .Which  pile  of 
sticks  is  more  likely  to  produce  those  two  sticks?..." 

The  fourth  experiment  in  this  investigation  (Domas,  Goodman  &  Peterson, 
1972);  examined  the  effects  of  six  different  response  situations.  This  exper¬ 
iment,  called  D,G&P,  used  a  between-S  design  with  each  S  making  only  one  type 
of  response  for  all  data.  The  six  response  conditions,  classified  according 
to  the  introductory  framework,  are  as  follows: 


DIMENSION 


i  . 

Response  Mode 

LR 

LR 

LR 

ODDS 

ODDS 

ODDS 

2: 

Aggregation 

NONCUM 

NONCUM 

NONCUM 

CUM 

CUM 

CUM 

3: 

Scale 

VERBAL 

VERBAL 

LOG 

VERBAL 

LOG 

LOG 

4: 

Additional  Feedback 

NONE 

V-C-0 

NONE 

NONE 

NONE 

V-C-P 
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rile  stimulus  environment  simulated  a  commercial  shipping  problem.  Kach 
§.  was  asked  to  assume  the  role  of  an  analyst  employed  by  a  IJ.S.  commercial 
shipping  firm.  The  S's  task  was  to  predict  whether  competitor  ships  were  des- 
tn  ed  for  Port  A  or  Port  B.  Port  A  was  more  distant  than  Port  B  and  involved 
a  long  open-water  voyage. 

Date  concerning  the  competitor  ships  fell  into  one  of  four  categories: 
age  of  ship,  gross  tonnage,  percent  of  capacity  cargo  load  on  boaru,  ana  fuel 
taken  on  at  the  port  of  departure.  Data  sarnp3.es  were  generated  by  drawing 
irom  pairs  ot  Gaussian  (normal)  populations ,  each  characterized  by  a  specified 
ii  level.  For  example,  data  about  the  age  of  competitor  ships  were  urawn  from 
two  normal  distributions  with  a  d'  of  0J10.  The  other  d'  levels,  0.80,  1.01, 
and  1.14,  were  associated  with  gross  tonnage,  percent  of  capacity  cargo,  and 
fuel  taken  on,  respectively. 

Twenty-six  sequences,  each  consisting  of  four  data,  were  generated  ran¬ 
domly  from  the  distributions  associated  with  the  four  categories  of  data. 

Lvery  sequence  contained  a  datum  from  each  of  the  four  categories.  Thirteen 
of  the  26  sequences  were  drawn  from  distributions  favoring  Port  A,  the  re¬ 
maining  .Ly  from  distributions  favoring  Port  B. 

The  population  characteristics  were  displayed  to  the  Ss  by  four  sets  of 
two  charts.  Kach  chart  was  a  randomly  arranged  representative  sample  of  the 
appropriate  population.  On  a  chart,  each  datum  was  a  7"  vertical  line  1/8" 
wide,  partly  black  and  partly  white.  Tiie  random  variable  had  two  interpreta¬ 
tions.  It  was  a  specific  scale  value,  e.g.  ,  oO’(,  capacity  cargo,  or  a  lengtn 
of  black,  e.g.,  4.11". 

The  charts,  eight  in  all,  were  arranged  on  a  displa\  board  as  follows: 


all  data  about  ships  going  to  Port  A  were  arranged  from  top  to  bottom  on  the 
Lett  side.  Similarly,  all  data  about  ships  destined  lor  Fort  R  were  arranged 
in  a  corresponding  fashion,  opposite  Fort  A.  There  was  a  sliding  scale  between 
each  of  the  four  pairs  of  charts.  The  F  could  vary  the  Length  of  a  ru,  iom  sam¬ 
ple  from  zero  to  7”  simply  by  moving  the  slider.  To  the  S  the  datum  could  be 
interpreted  as  a  length,  relative  to  the  other  lengths  of  black  on  the  charts, 
or  as  a  number,  representing  the  value  of  the  lata  item  in  question. 

There  were  three  different  sets  of  response  sheets.  For  Ss  making  LB  as¬ 
sessments  on  a  verbal  scale,  each  response  sheet  read  as  follows: 


"It  is 


t  mes  more  likiy  tiiat  tills  datum  would  occur 


if  the  ship  were  going  to  l'ort  _  rather  than  to  the 

other  Port. " 


j-he  Ss  recorded  their  assessments  for  each  datum  on  a  separate  page.  Each  re¬ 
sponse  sheet  lor  Ss  making  odds  assessments  read  as  follows: 

_  is  favored.  The  odds  favoring  this  Port  are 


Here  again  Ss  made  written  assessments,  one  per  page.  The  log  scale  response 
sheet  was  a  17"  by  22"  sheet  of  paper  with  three  logarithmically  spaced  scales 
Iran  1:1  to  1000:1  on  the  top  half  and  symmetric  values  from  1:1  to  1:1000  on 
the  bottom  half.  This  sheet  was  further  divided  by  iO  vertical  lines  with 
each  line  having  the  same  logarithmic  spacing.  The  Ss  used  one  sheet  for  each 
sequence  ol  four  data,  allowing  one  vertical  line  per  datum. 

Forty,  diversity  of  Michigan,  male  students  served  as  Ss.  Ten  Ss  were 
run  in  each  of  two  conditions,  N0NCUM  LOG  LB  with  no  additional  feedback  and 
CUM  LOG  ODDS  with  V-C-P  additional  feedback.  Five  Ss  took  part  in  each  of  the 


remaining  l'our  conditions. 


Each  S  was  run  individually. 


The  last  experiment  in  this  investigation,  called  S,PKM,  was  periormed 
by  Kurt  Snapper,  Cameron  Peterson  Sc  Allan  Murphy-  The  two  hr.  in  this  exper 
ment  were  University  of  Michigan  faculty  members  who  had  been  weather  fore¬ 
casters.  Their  task  was  to  assess  the  precipitation  prolabilitv  torecast  ior 
each  of  25  sequences  of  historical  data.  Each  sequence  contained  one  sample 
of  each  of  nine  different  sources  of  information,  presentee  in  chart  form. 

The  sources  selected  were  the  most  important  items  of  information  tunt  won'J  d 
be  received  in  a  forecast  period.  Moreover,  these  data  appeared  in  the  otdex 
that  they  would  be  received  at  a  U.S.  Weather  Bureau  Station.  Ihese  souices 
included  such  items  as  isoprobability  curves,  upper  and  lower  atmospheric 
charts,  and  barometric  charts.  Because  these  sources  of  information  aie  con¬ 
ditionally  dependent  sources  given  the  precipitation,  no  precipitation  hypoth¬ 
eses  each  S's  task  was  to  estimate  conditionally  dependent  probabilities. 

Their  task,  the  task  of  assessing  the  precipitation  probability  forecast  more 
strictly  defined,  was  the  task  of  assessing  the  probability  of  at  least  .0.1 
of  precipitation  within  the  forecast  period. 

This  experiment  used  a  within-S  design.  The  two  Ss  assessed  the  pjecip- 
itat.ion  probability  forecast  in  two  different  ways.  however,  each  time  they 
were  given  the  same  2 5  sequences  of  nine  data. 

Tn  one  situation  S  was  asked  for  a  cumulative  conditional  probability 
.judgment.  In  other  words ,  he  was  asked  for  his  present  assessment  ol  the 
probability  of  precipitation  given  his  previous  probability  assessment  and 
the  new  information  just  presented  in  the  current  datum.  At  the  beginning  oi 
a  new  sequence,  S  was  to  assume  that  the  precipitation  probability  was  .  A  • 
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"The  precipitation  probability  forecast,  based  on 
Chart  #102  is  _ . 

The  revised  probability  forecast,  based  on  Chart  II'  s 
8  and  9  is  _ . 

The  revised  probability  forecast,  based  on  Chart  //•  12 
is 


(and  finally) 

The  revised  probability  forecast,  based  on  Chart  //hi 


In  the  other  response  situation  S  was  asked  for  a  noncumulative  condi¬ 
tional  probability  assessment.  In  this  situation,  fcr  each  assessment  S  was 
to  assume  that  all  the  previous  data  had  led  him  to  a  probability  assessment 
of  .50.  His  task  was  to  revise  this  probability  on  the  basis  of  the  new  in¬ 
formation  presented  to  him  in  the  current  datum.  In  this  condition  S  filled 
out  nine  sheets  of  paper  for  each  sequence.  The  first  sheet  read  as  follows 
"The  forecast,  based  only  on  the  information  in  Chart 
#102  is  _ .  " 

The  second  sheet  read  in  the  following  way: 

"Assume  that  the  previous  precipitation  probability 
forecast,  based  on  all  previous  information  was  .  50. 

The  revised  forecast,  based  on  only  the  new  informa¬ 
tion  in  Chart  // '  s  8  and  9  i-s  _ •" 
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The  ninth  sheet  read  as  follows: 


! 


"Assume  that  the  previous  precipitation  probability 
forecast,  based  on  all  previous  information  was  .50. 

The  revised  forecast,  based  on  only  the  new  informa¬ 
tion  in  Chart  //4l  is  _ .  " 

The  S  received  no  additional  feedback  in  either  response  condition.  Each 
S  was  run  individually. 

According  to  the  framework  of  this  investigation,  the  two  response  situa¬ 
tions  in  this  experiment  are  classified  in  the  following  way: 

DIMENSION 

1:  Response  Mode  PROB  PROB 

2:  Aggregation  CUM  NONCUM 

3:  Scale  VERBAL  VERBAL 

4:  Additional  Feedback  NONE  NONE 


RESULTS 


i‘ie  following  discussion  uses  two  conventions ;  both  are  typically  also 
embodied  in  the  experiments  being  reanalyzed  ami  in  the  original  reports  oi‘ 
them.  One,  only  a  convention,  is  that  odds  are  expressed  as  numbers  ol'  the 
term  X:l,  where  X  >  1.  This  can  always  be  accomplished  by  appropriate  choice 
ol  which  hypothesis  enters  into  the  numerator  and  which  into  the  denominator 
of  the  odds  ratio.  A  similar  convention  applies  to  likelihood  ratios,  except 
in  contexts  in  which  consistency  with  the  preceding  convention  for  odds  re¬ 
quires  expressing  the  likelihood  ratio  as  a  number  less  than  1.  The  second 
convention  is  that  the  conclusions  are  expressed  as  though  the  prior  odds  be¬ 
fore  the  first  datum  in  each  sequence  were  1:1.  In  most  of  the  experiments 
here  reanalyzed,  that  was  true.  Where  it  was  not,  appropriate  attention  was 
paid  to  the  question  during  the  data  analysis. 

In  general,  the  following  data  analyses  were  done  on  final  log  odds,  re¬ 
gardless  of  the  response  mode  actually  used  by  Ss.  (An  exception  is  noted  in 
the  text.)  "Final"  means  that  the  response  is  the  last  odds  estimate  or  other 
estimate  of  a  cumulative  quantity  associated  with  a  sequence  of  data  that  led 
oo  successive  revisions  of  that  quantity,  if  a  cumulative  response  mode'  was 
use.,  or  was  the  appropriate  corresponding  number  calculated  from  a  sequence 
of  noncumulative  estimates,  if  a  noncumulative  response  mode  was  used.  So  a 
statement  such  as  "noncumulative  LR  responses  are  larger  than  cumulative  odds 
responses  translates  into  final  log  odds  calculated  from  noncumulative  LR 
i espouses  are  more  extreme,  in  the  appropriate  direction,  than  the  logarithms 
of  final  cumulative  odds  responses." 


18 


<bn 


Kor  this  investigat i on  one  /'.roup  of  responses  is  considered  significantly 


larger  than  another  group  of  responses  when  the  following  two  conditions  are 
met:  (l)  the  correlation  coefficient  between  the  two  sets  of  assessments  is 

high  enough  so  that  it  can  be  assumed  that  the  two  groups  of  responses  are 
linearly  related  (high  enough  is  arbitrarily  defined  as  greater  than  . 900); 
and  (ki)  anti  9 9/0  confidence  interval  for  (5,  defined  as  Uie  estimated  slope  of 
the  linear  structural  relation  between  these  two  random  variables,  does  not 
contain  the  point  1.000. 

There  is  a  basic  difference  between  a  linear  structural  relation  analysis 
and  a  linear  regression  analysis  (see  Isaac,  1970,  and  Kendall  &  Stuart,  1961). 
The  purpose  of  a  linear  regression  analysis  is  to  find  the  parameters,  a  and  p, 
of  the  line  that  best  predicts  the  values  of  the  dependent  variable  given  the 
values  of  the  independent  variable.  It  is  assumed  that  the  dependent  variable 
is  a  random  variable  and  that  the  independent  variable  is  fixed  and  measured 
without  error.  The  purpose  of  a  linear  structural  relation  analysis,  however, 
is  to  determine  the  interrelationship  between  two  variables  when  both  are  ran¬ 
dom  and  when  either  or  both  of  them  are  measured  with  error.  The  interrela¬ 
tionship  is  measured  by  the  slope,  p,  of  the  straight  line  relating  one  depen¬ 
dent  variable  to  the  other. 

The  model  for  a  linear  structural  relation  is  given  by  the  following  for¬ 
mula  : 

y  Q  +  P(x-rx)  +  t  ,  (1) 

where  «  is  the  random  error  in  measuring  x  and  t  is  the  random  error  in  mea- 
x  y 

spring  y. 
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When  X  =  variance  of  c variance  of  (  ^  is  known,  the  formula  for  measuring 


(3,  as  given  by  Isaac  (1970)  is  as  follows: 


var  y  -  A.  var  x  +  [(var  y  -  A.  vav  <  hx  Cov‘~(\  ,.v ) 

?  Cov(x,y) 


(2) 


The  confidence  limits  for  $  are  given  by  the  following  formula  taken  from 
Kendall  1-  Stuart  (1961): 


(§i—  arc  sin 

2t| 

r 

var  x  •  var  y  - 

,  2, 

Cov  (x,y 

)  'i1/2 

l 

1 

p 

(n-2) [(var  x  -  var  y)'~ 

p 

t  h  Cov u 

(x,y)]J 

(3) 


where  0  =  arc  tan  $  and  t  is  the  appropriate  "Student's"  deviate  for  (n-2) 
degrees  of  freedom  and  the  confidence  level  chosen. 

In  order  to  calculate  $  and  the  confidence  interval  for  X,  the  ratio 
of  the  error  variances,  must  be  determined.  The  error  variances  were  estimated 
differently  in  the  different  experiments.  In  the  first  experiment,  PIPID,  no 
scenarios  were  repat ed.  Therefore,  the  slope  $  for  the  two  groups  in  this 
study  was  estimated  for  two  values  of  X,  X  =  1  and  X  -  0.  These  values  of  X 
are  extreme  values  given  the  X  values  estimated  in  the  other  experiments  where 
reliability  information  was  available.  (Here  and  in  the  Snapper,  Peterson  & 
Murphy  experiment  discussed  below,  the  choice  of  which  dimension  to  call  x  and 
which  to  call  y  was  made  to  keep  X  <  1. } 

In  the  second  experiment,  CORNOC,  the  error  variances  were  estimated  from 
repeated  data  values.  All  of  the  21  Ss  in  CORNOC  repeated'' at  least  one  of  the 
nine  scenarios  and  19  of  these  Ss  repeated  four  scenarios.  Since  there  were 
five  final  odds  per  scenario  and  four  scenarios  with  repeat  data,  there  were 
20  points  for  which  I  calculated  both  the  first  round  average  group  final  log 
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odds  ass* ssments  and  the  second  round  average  group  final  log  odds  assessments. 
The  error  variances  were  then  estimated  to  be  the  variances  of  the  diilereuce 
between  the  average  first  round  and  second  round  final  log  odds  assessments. 

The  necessary  X  values  were  then  calculated  by  taking  the  appropriate  ratios 
of  these  error  variance  estimates. 

All  Ss  in  the  Wheeler  &  Edwards  experiment  repeated  eight  of  the  lo  sce¬ 
narios.  Each  S  repeated  two  scenarios  In  each  of  the  four  different  response 
conditions.  The  experimental  design  is  such  that  for  each  scenario  under  each 
response  condition,  there  are  from  7  to  .10  persons  with  repeat  data.  The  es¬ 
timated  error  variance  for  each  response  condition  is  the  variance  of  the  dif¬ 
ference  between  the  first  and  second  round  average  final  log  odds  measurements 
for  the  eight  repeated  scenarios 

In  the  Domas,  Goodman  &  Peterson  experiment,  no  scenarios  were  repeated. 
However,  of  the  10 4  stimuli,  13  pairs  (26  stimuli)  were  such  that  the  true  LR 
of  one  member  of  the  pair  was  within  .01  of  the  true  value  of  the  other  member. 

These  2b  data  or  13  pairs  were  used  to  estimate  error  variances  for  each  group 

of  Ss.  The  first  occurrence  of  one  of  the  pair  was  considered  the  first  round 
estimate,  while  the  occurrence  of  the  second  member  of  this  pair  was  considered 
the  second  round  estimate.  Each  of  the  first  and  second  round  assessments  was 
then  averaged  across  the  Ss  within  a  group.  The  error  variance  for  each  of  the 

six  groups  of  Ss  was  estimated  to  be  the  variance  of  the  difference  between 

what  is  considered  the  first  and  second  round  average  log  LR  assessments  tor 
the  13  pairs  of  data. 

There  were  no  repeat  data  in  the  Snapper,  Peterson  &  Murphy  experiment. 
Consequently,  $  was  estimated  twice,  using  the  two  \  values  of  one  and  zero. 
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The  summary  of  the  linear  structural  relation  analysis  statistics  for  Lie 
iiffereut  groups  ol  Ss  In  the  different  experiments,  when  otiL.v  a  single  dimen- 
oioi,  was  varied  ii  the  response  conditions  between  the  groups  beit  g  compared, 
is  contained  in  Table  The  dependent  variable  for  all  comparisons  is  the 
average  final  log  odds  for  each  group.  The  final  odds  in  each  case  was  eitner 
estimated  directly  or  calculated  by  means  of  Bayes's  Theorem. 

On  the  basis  of  the  evaluation  criteria  initially  established,  i.e., 
r  >  .900  and  tiie  99",  confidence  interval  for  $  not  containing  1.000,  several 
conclusions  can  be  drawn  from  this  analysis.  First,  odds  assessments  are 
oomeuimes  signil icantly  larger  than  LR  assessments.  Verbal,  cumulative  odds 
estimates  were  significantly  larger  than  verbal,  cumulative  LR  estimates,  when 
there  was  no  additional  feedback  to  either  group.  However,  odds  were  not  sig- 
nii icantly  larger  than  LRs  vhen  both  were  verbal,  noneumulative  estimates  witi. 
no  additional  feedback. 

A  second  conclusions  is  that  aggregation  makes  a  difference  for  verbal 
responses  with  no  additional  feedback.  Nonaggregated  judgments,  w nether  LRs 
or  odds,  1  esulte  3  in  signil  icantly  larger  final  posterior  odds  tha.’  did  aggre¬ 
gated  judgments.  There  is  evidence  that  this  same  conclusion  may  hold  for 
^.onuitio;  al  probabilities.  In  tiie  Snapper,  Peterson  &  Murpny  experiment,  tne 
data  for  eac:i  forecaster  were  analyzed  separately.  The  £  values  were  estimated 
j.or  cumulative  probability  plotted  on  the  y-axis  and  nor, cumulative  probability 
plotted  on  the  x-axis.  For  one  forecaster  £  is  .292  and  \  =  1,  and  .bj39  when 
k  =  0.  The  95%  confidence  interval  for  the  larger  value  is  (J+bO,  -89M-  The 
slope  for  the  other  forecaster  is  .218  when  \  =  1,  and  09b  when  \  =  0.  The 
99/o  confidence  interval  around  .99b  is  ( o02,  .901).  These  values  were  not 


put  into  Table  2  because  the  correlation  coefficients,  . .  >29  ami  .727,  for  these 
two  sets  of  assessment'  were  considerably  lower  thai  tne  correlations  for  the 
other  functions.  However,  this  correlat ional  analysis  was  done  in  probabili¬ 
ties  and  not  log  odds,  thereby  constraining  the  range  of  the  scales.  Moreover, 
each  of  these  two  analyses  was  based  on  individual  data,  not  averaged  data. 
Since  the  upper  bound  of  the  confidence  interval  for  p  is  Jess  than  one 
for  both  forecasters,  the  data  strongly  suggest  that  aggregation  reduces  the 
size  of  probability  assessments.  This  issue  needs  to  be  tested  in  controlled 
e\pei  iinents  witli  many  S_s,  using  data  that  would  lead  to  final  cumulative  prob¬ 
abilities  that  span  the  range  from  . ^  to  at  least  .999.  Covering  this  range 
densely  enough  to  permit  separate  analyses  cf  more  and  less  extreme  regions  is 
important  because  many  researchers  believe  that  Ss  don’t  know  how  to  use  the 
extreme  regions  of  the  probability  scale  correctly.  Therefore,  experimental 
tests  that  are  based  solely  on  responses  appropriate  to  the  ends  of  the  scale 
may  be  susceptible  to  scale  errors  in  addition  to  whatever  assessment  errors 
may  exist.  To  guard  against  very  nonhomogeneous  ’random  response  error’  in 
ditieient  parts  of  the  probability  scale,  stimuli  should  be  selected  from  all 
parts,  and  data  analyses  performed  on  subsets  of  the  data. 

A  nil d  conclusion  is  that  the  scale  makes  a  significant  difference  when 
tnere  is  no  aidi^ional  feedback.  Log  scale  responses  were  significantly  larger 
than  verbal  scale  responses  for  botli  NONCUM  LRs  and  CUM  ODDS,  when  there  was 
no  additional  feedback. 

A  fourth  conclusion  from  Table  2  is  that  additional  feedback  of  the  pos¬ 
terior  probabilities  oi  the  hypotheses  under  consideration  makes  a  significant 
difference  regardless  of  whether  that  feedback  is  just  for  the  current  latum 
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or  for  all  previous  data.  There  is  also  some  evidence  that  cumulative  odas 
feedback  may  make  a  significant  difference  for  NONCUM  LR  responses  on  a  verbal 
scale,  even  though  tiie  c/jj0  confidence  interval  for  $  contains  1.000.  In  the 
bomas,  Uoodinan  &  Peterson  experiment,  in  addition  to  doing  the  linear  struc¬ 
tural  relation  analyses  on  the  2o  average  final  log  odds  for  each  group,  I 
also  did  these  same  analyses  using  the  10^  average  log  LRs  as  points.  The 
conclusions  to  be  drawn  in  Table  2  from  the  average  log  LR  analyses  were  tne 
same  as  from  the  average  final  log  odds  analyses  except  for  the  comparison  of 
NONCUM  LR  assessment  on  a  verbal  scale  with  and  without  V-C-0  additional  feed¬ 
back.  In  this  case,  the  log  LR  analysis  resulted  in  a  significant  difference 
between  the  two  groups.  The  group  without  the  additional  feedback  made  larger 
estimates  than  hie  group  witli  the  V-C-0  additional  feedback.  Here  is  another 
hypothesis  that  can  be  put  to  experimental  test.  But  this  hypothesis  warrants 
further  testing  not  only  because  the  different  dependent  variables  resulted  in 
different  levels  of  significance,  but  also  because  the  slopes  were  so  very 
close  to  one,  .9^  for  the  final  log  odds  analysis  and  .907  for  the  log  LR 
analysis. 

The  previous  conclusions  were  based  on  comparisons  between  group  estimates 
without  considering  optimality.  For  all  the  data  in  both  the  W&E  and  D,ir&P 
experiments,  the  true  value  of  each  LR  was  known.  By  applying  Bayes's  Theorem, 
the  final  odds  was  calculated  for  each  sequence  of  data  in  both  experiments. 
Linear  regression  analyses  were  performed  whereby  the  average  group  final  log 
odds  for  all  sequences  of  data  were  compared  with  the  Bayesian  final  log  odds, 
liven  high  correlation  coefficients  and  intercepts  close  to  zero,  the  closer 
a  particular  regression  slope  is  to  one,  the  more  optimal  is  the  group's 
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estimates.  The  summary  of  these  linear  regression  analyses  is  found  in  Table 
•  Howevei  ,  since  no  statistical  tests  were  performed  on  the  iif t'ej’ences  oe— 
tween  these  slopes,  we  cannot  conclude  that  any  one  slope  is  significantly 
more  accurate  than  any  other  slope. 

In  the  W&E  experiment  all  four  groups  used  verbal  scales  and  had  no  addi¬ 
tional  feedback.  Under  these  conditions,  the  cumulative  odds  assessments  were 
signii icantly  larger  than  the  cumulative  1,R  assessments ,  and  tiie  slope  for  the 
cumulative  odds  versus  Bayesian  odds  regression  was  o58  as  contrasted  with  a 
slope  of  .283  for  the  cumulative  LK  group.  Therefore,  we  don't  know  whether 
there  is  a  tendency  for  cumulative  odds  to  be  larger  or  more  accurate  than 
cumulative  LRs.  The  hypothesis  that  needs  to  be  tested  is  that  cumulative 
odds  assessments  are  more  accurate  than  cumulative  LR  assessments.  A  test  of 
this  kind  requires  an  experimental  situation  where  truth  is  known,  where  tne 
veridical  cumulative  odds  judgments  run  the  range  of  values  from  1:1  to  very 
large  numbers  of  at  least  1000:1.  Furthermore,  separate  data  analyses  should 
be  none  over  different  intervals  of  the  range. 

In  this  same  experiment  the  non cumulative  odds  judgments  were  larger,  but 
not  significantly  larger,  than  the  noncumulative  LR  judgments.  However,  the 
slope  tor  the  noncumulative  odds  group  in  the  Bayesian  regression  was  farther 
from  one,  1.166,  than  the  slope,  1.0^1,  for  the  noncumulative  LR  group.  Con¬ 
sequently,  for  the  nonaggregated  condition,  the  data  suggest  that  while  odds 
assessments  may  be  larger  than  LR  assessments,  they  may  also  be  less  accurate. 
Tills  hypothesis  needs  to  be  tested  us  ing  situations  where  truth  Is  known,  where 
tiie  veridical  noncumulative  likelihood  judgments  run  over  a  large  range  of 
values  and  where  separate  data  analyses  are  done  over  different  portions  of 
tne  range. 
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LINEAR  Rfc GRES  SION  STATISTICS  FOR  CROUP  FINAL  LOG  ODDS, 
AS  DEPENDENT  VARIABLE, 

compared  with  bayesian  final  log  odds 
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In  the  D, G&P  experiment  accuracy  and  size  comparisons  can  be  made  for  the 
scale  dimension.  Assessments  made  on  a  log  scale  were  significantly  larger 
than  assessments  made  on  a  verbal  scale  for  both  the  noncumulative  LR  and  cu¬ 
mulative  odds  response  groups  that  received  no  additional  feedback.  Moreover, 
these  log  scale  assessments  resulted  in  slopes  further  away  from  1.000  in  the 
Bayesian  regression  These  results  suggest  that  assessments  made  on  a  log 
scale  may  be  larger  than  assessments  made  on  verbal  scale,  irrespective  of  ac¬ 
curacy,  when  there  is  no  additional  feedback.  This  hypothesis  needs  further 
testing  in  situations  where  truth  is  known,  where  the  true  values  extend  over 
a  wide  range,  and  where  separate  data  analyses  are  done  over  different  parts 
of  this  range. 

Bata  from  the  D,GSP  experiment  also  suggest  that  while  cumulative  odds 
assessments  on  a  log  scale  with  no  additional  feedback,  are  significantly  larger 
than  this  same  kind  of  assessment  with  V-C-P  additional  feedback,  they  may  also 
be  less  accurate.  Thus  cumulative  posterior  probability  additional  feedback 
may  increase  accuracy,  at  least  for  an  aggregated  odds  response  made  on  a  log 
scale.  This  hypothesis  needs  to  be  tested. 

While  linear  structural  relation  analyses  comparing  just  those  groups 
where  a  single  dimensioi  in  the  response  condition  is  varied  at  any  one  time 
identifies  those  dimensions  that  affect  the  size  of  assessments,  such  analyses 
do  not  give  any  information  about  the  relative  strengths  of  the  effects  of  re¬ 
sponse  conditions  on  response  magnitudes.  To  determine  an  exact  ordering  of 
the  riferent  response  conditions  based  either  on  the  size  or  accuracy  of  the 
assessments  would  require  a  very  large  factorial  experiment— which  has  not 
been  Jone.  However,  Jt  is  possible  to  get  some  Information  about  an  ordering 
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of  the  different  response  conditions  based  on  size  of  assessments  by  doing 
linear  structural  relation  analyses  on  groups  in  which  two  or  more  response 
dimensions  vary  simultaneously.  The  statistics  summarizing  certain  of  these 
linear  structural  relation  analyses  are  presented  in  Table  4. 

An  order  relation,  >,  is  defined  as  follows  over  two  sets  of  assessments, 
A  and  B,  where  the  A  values  are  plotted  on  the  y-axis  and  the  B  values  on  the 
x-axis.  First,  A  >  B,  if  $  in  the  structural  relation  analysis  is  greater 
tnan  one  and  if  the  9 5$  confidence  interval  for  $  does  not  span  1.000.  Second, 
A  >  B,  if  several  different  structural  analyses  compare  A  and  B,  A  >  B  by  as¬ 
sumption  1  for  at  least  one  comparison,  and  not  B  >  A  for  any  of  the  others. 

If  not  A  >  B  and  not  B  >  A,  then  A  ?  B. 

The  simultaneous  varying  of  response  mode  and  aggregation  in  the  W&E  ex¬ 
periment  provides  much  information.  From  the  single  dimensional  analyses  of 
these  data  we  know  that  ODDS  >  LRs  and  that  NONCUM  responses  >  CUM  responses. 
These  inequalities  suggest  one  of  the  two  following  orderings: 

NONCUM  ODDS  >  NONCUM  LR  >  CUM  ODDS  >  CUM  LR 

or 

NONCUM  ODDS  >  CUM  ODDS  >  NONCUM  LR  >  CUM  LR  . 

To  determine  the  exact  ordering  of  these  four  groups,  the  following  six  com¬ 
parisons  are  necessary:  NONCUM  ODDS  &  NONCUM  LR,  NONCUM  ODDS  &  CUM  ODDS, 
NONCUM  ODDS  &  CUM  LR,  NONCUM  LR  &  CUM  ODDS,  NONCUM  LR  &•  CUM  LR,  and  CUM  ODDS  & 
CUM  LR.  The  single  dimensional  analyses  sfiow  that  NONCUM  ODDS  ?  NONCUM  LR, 
NONCUM  GDI'S  >  CUM  ODDS,  NONCUM  LR  >  CUM  LR  and  CUM  ODDS  >  CUM  LR.  Varying 
these  two  dimensions  simultaneously  results  in  NONCUM  LR  >  CUM  ODDS  and 
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COMPARISONS 


NONCUM  ODDS  >  (  "M  LR.  Therefore,  the  ordering  that  emerges  from  the  W&E  exper¬ 
iment  is  NONCUM  ODDS  >  NONCUM  LR;  NONCUM  ODDS, NONCUM  LR  >  CUM  ODDS  >  CUM  LR. 
This  ordering  implies  that  the  aggregation  dimension  is  more  important  than 
the  response  mode  dimension  in  determining  the  size  of  assessments. 

In  the  D,G&P  experiment  these  same  two  dimensions  were  simultaneously 
varied.  When  the  scale  and  additional  feedback  conditions  were  the  same  as 
in  the  W&E  experiment,  i.e. ,  verbal  scale  and  no  additional  feedback,  the  same 
interactive  ordering  emerged,  NONCUM  LR  >  CUM  ODDS.  W"nen  the  log  scale  was 
used  as  a  recording  mechanism  instead  of  the  verbal  scale,  the  ordering  was 
NONCUM  LR  ?  CUM  ODDS  with  final  log  odds  as  the  dependent  variable  in  the 
structural  analysis  and  NONCUM  LR  >  CUM  ODDS  with  log  LRs  as  the  dependent 
variable.  Thus  both  experiments  confirm  the  finding  that  aggregation  is  more 
important  than  response  mode  in  determining  the  size  of  assessments. 

The  scale  and  additional  feedback  dimensions  were  simultaneously  varied 
in  both  the  CORNOC  and  D,G&P  experiments.  The  single  dimensional  structural 
analyses  of  the  CORNOC  experiment  yielded  these  inequalities:  LOG  >  VERBAL 
and  NONE  >  G-N-P  for  NONCUM  LR  responses.  These  two  inequalities  imply  one 
of  the  two  following  orderings : 

Log, none  >  log, g-n-p  >  verbal, none  >  verbal, g-n-p 

or 

LOG, NONE  >  VERBAL, NONE  >  LOG, G-N-P  >  VERBAL, G-N-P  . 

To  completely  determine  the  correct  ordering  six  comparisons  are  necessary. 

The  single  dimensional  analyses  yielded  LOG, NONE  >  VERBAL, NONE  and  LOG, NONE  > 
LOG, G-N-P.  The  analysis  in  Table  R  showed  that  VERBAL, NONE  >  LOG, G-N-P. 
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Without  all  six  comparisons  the  ordering  cannot  be  completely  determined. 
However,  the  ordering  suggested  by  the  data  is  as  follows: 

LOG, NONE  >  VERBAL, NONE  >  L0G,G-N-1  >  VERBAL, G-N-P  . 

This  ordering  implies  that  the  additional  feedback  of  the  posterior  probabili¬ 
ties  over  the  hypotheses,  even  for  a  single  datum  assessment,  may  be  more  im¬ 
portant  than  the  log  scale  in  determining  the  size  of  noncumulative  LR  esti¬ 
mates.  This  hypothesis  needs  testing. 

The  single  dimensional  analyses  of  the  D,G&P  experiment  resulted  in  the 
following  inequalities:  LOG, NONE  >  VERBAL, NONE  for  both  CUM  ODDS  and  NONCUM 
LR;  LOG, NONE  >  LOG,V-C-P  for  CUM  ODDS;  and  VERBAL, NONE  >  VERBAL, V-C-0  for 
NONCUM  LR.  The  analyses  in  which  two  response  dimensions  were  varied  simul¬ 
taneously  yielded  LOG, NONE  >  VERBAL, V-C-0  for  NONCUM  LRs  and  ambiguous  results 
for  the  LOG, V-C-P  and  VERBAL, NONE  comparison  for  CUM  ODDS.  The  structural  re¬ 
lation  analysis  of  average  final  log  odds  yielded  a  £  of  1.020  with  a  9 %  con¬ 
fidence  interval  of  (.9I0,  1.198)  when  LOG, V-C-P  is  plotted  on  tne  y-axis  and 
VERBAL, NONE  on  the  x-axis.  The  structural  relation  analysis  of  average  log 
LR  yielded  a  (3  of  .882  with  a  9 yj0  confidence  interval  of  ( .  ^90,  .989)  for  this 
same  comparison.  These  findings  for  the  D, G&P  experiment  show  that  the  LOG, 
NONE  condition  yielded  the  largest  estimates  of  all  and  that  the  log  scale  is 
more  important  than  V-C-0  additional  feedback  in  determining  the  size  of 
NONCUM  LR  responses. 

The  CORNOC  and  D,G&P  experiments  together  imply  that  when  considering 
just  the  log  scale  and  additional  feedback  dimensions,  one  is  not  more  impor¬ 
tant  than  the  other.  The  order  of  these  two  dimensions  depends  upon  the  type 

92 

..  .  .  _ _  _ 


of  feedback  given.  The  results  were  that  odds  additional  feedback  is  not  more 
important  than  the  log  scale  in  determining  the  size  of  LR  estimates,  but  that 
probability  feedback  may  be  more  important  than  the  log  scale  in  determining 
the  size  of  CUM  ODDS  estimates.  This  latter  assertion  is  a  hypothesis  in  need 
of  a  test. 

Simultaneously  varying  all  four  dimensions  in  the  D,G&P  experiment 
yielded  some  interesting  information.  The  ODDS, CUM, LOG, NONE  response  condi¬ 
tion  resulted  in  larger  estimates  than  the  LR,NONCUM, VERBAL, V-C-0  response 
condition.  The  LR,NONCUM, VERBAL, V-C-0  response  condition  resulted  in  larger 
estimates  than  the  ODDS, CUM, LOG, V-C-P  response  condition.  If  we  are  willing 
to  assume  transitivity,  then  these  two  results  together,  ODDS, CUM, LOG, NONE  > 
LR,NONCUM, VERBAL, V-C-0  >  ODDS , CUM, LOG, V-C-F,  imply  that  probability  feedback 
is  more  important  than  the  combined  effects  of  the  response  mode,  aggregation, 
and  scale  dimensions. 

Reliability  information  was  used  in  the  linear  structural  relation  anal¬ 
yses  to  estimate  error  variances.  However,  the  reliability  of  LR  and  ODDS  es¬ 
timates  is  a  topic  worthy  of  independent  consideration.  Some  researchers  have 
collected  repeat  data  on  Ss  who  performed  in  Bayesian  information  processing 
experiments.  No  one,  to  m.v  knowledge,  has  reported  any  of  it,  however.  The 
ilata  in  Table  b  fill  this  void  somewhat. 

I  analyzed  the  repeat  data  for  Ss  .in  the  CORNOC,  W&E,  and  D.  ]&P  experi¬ 
ments.  From  each  experiment  the  particular  data  points  that  are  used  for  t  lese 
analyses  are  the  same  lata  points  that  were  used  to  determine  the  error  vari¬ 
ances  for  the  linear  structural  analyses.  The  different  statistics  incorpo¬ 
rated  into  Table  b  are  the  following:  (l)  the  difference  between  the  first 
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round  and  second  round  assessments  averaged  over  the  number  of  points  in  the 
function;  (2)  the  percentage  of  the  absolute  value  of  this  average  difference 
divided  by  the  average  range  of  assessments  in  the  first  round;  (3)  the  per¬ 
centage  of  the  absolute  value  of  the  average  difference  divided  by  the  average 
range  of  assessments  in  the  second  round;  (^t)  the  variance  of  the  difference 
between  the  first  and  second  round  assessments  averaged  over  the  number  of 
points  in  each  function;  and  (5)  the  correlation  coefficient  of  the  function 
relating  the  first  round  assessments  to  the  second  round  assessments. 

By  any  criterion,  most  of  the  groups  of  S.s  in  all  three  experiments  were 
very  reliable. 
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DISCUSSION  AND  CONCLUSIONS 


This  study  lias  attempted  to  organize  the  information  available  on  differ¬ 
ent  jirt -t  estimation  procedures  investigated  in  experiments  on  Bayesian  in- 
i  or  mat  i  or:  processing.  A  taxonomy  classified  different  response  situations  on 
t;.e  basis  of  four  independent  dimensions — response  mode,  aggregation,  scale, 
and  additional  feedback. 

Linear  structural  analyses  were  performed  in  WnLcli  the  different  response 
situations  were  compared.  fPwo  criteria  were  established,  a  correlation  coef- 
f- cieiit  of  greater  than  . 900  and  a  9 y  confidence  interval  about  the  slope  not 
containing  the  point  1.000.  When  these  criteria  were  met,  the  set  of  estimates 
made  under  one  response  condition  was  considered  significantly  larger  than  the 
set  of  estimates  made  in  the  othe1"  response  condition. 

On  the  basis  oi  these  criteria  and  the  regression  analyses  performed  when 
truth  was  known,  the  following  resuits  have  been  demonstrated: 

1.  The  response  mode  sometimes  makes  a  difference.  CUM  ODDS  assessments 
were  significantly  .Larger  than  CUM  1,R  assessments  when  both  sets  of 
estimates  were  recorded  on  verbal  scales  and  there  was  no  additional 
feedback . 

2.  Aggregation  makes  a  difference.  Nonaggregated  LRs  or  ODDS  were  sig- 
.'.x iicantly  larger  than  the  judgments  made  in  the  corresponding  aggre¬ 
gated  conditioi  s  when  the  responses  were  made  on  verbal  scales  and 
there  was  no  additional  feedback.  There  is  data  to  suggest  the  hy¬ 
pothesis  that  this  finding  may  hold  true  for  PROBs. 


I 

|  3.  The  scale  used  makes  a  difference.  Log  scale  recording  devices  re¬ 

sulted  in  significantly  larger  assessments  than  verbal  scale  record¬ 
ing  devices  for  both  the  NONGUM  LR  and  CUM  ODDS  groups  when  there  was 
no  additional  feedback. 

1 .  Additional  feedback  of  tiie  posterior  probabilities  of  the  hypotheses 
does  make  a  significant  difference  in  the  size  of  NONCUM  LR  ana  CUM 
ODDS  judgments.  There  is  some  evidence  to  suggest  the  hypouhesis 
that  V-C-0  additional  feedback  may  make  a  difference  for  NONCUM  LRs 
on  a  verbal  scale. 

5.  The  use  of  log  recording  scales  results  in  significantly  larger  as¬ 
sessments  than  the  use  of  other  methods  of  recording  estimates.  It 
may  be  that  this  result  holds  regardless  of  accuracy.  This  hypothe¬ 
sis  needs  to  be  tested. 

6.  Cumulative  posterior  probability  additional  feedback  may  increase  the 
accuracy  for  CUM  ODDS  assessments  made  on  a  log  scale.  This  is  one 
more  hypothesis  to  be  investigated. 

7.  Aggregation  is  more  important  than  response  mode  in  determining  the 
size  of  assessments. 

R.  The  log  scale  is  more  important  than  V-C-0  additional  feedback  in  le- 
termining  the  size  of  NONCUM  LR  responses. 

9.  The  hypothesis  that  cumulative  probability  additional  feedback  may  be 
more  important  than  the  combined  effects  of  the  response  mode,  aggre¬ 
gation  and  scale  dimensions  is  the  most  exciting  hypothesis  that  these 


data  suggest  for  me.  But  this  issue  needs  very  careful  examination. 


iiit!  findings  oi  this  in  vest  igat  ion  are  supported  by  previous  research. 
Kaplan  and  Newman  (196b)  toiuni  that  the  computer-aggregated  posterior  proba¬ 
bilities  for  Ss  making  nonaggregated  l'(D|il)  judgments  were  significantly  larger 
tiian  posterior  probabilities  directly  assessed  by  Ss  making  aggregated  P(H|p) 
judgments.  The  aata  from  one  of  the  three  experiments  reported  in  Phillips  & 
Edwards  (1966)  suggest  that  log  scales  increase  the  size  of  probability  judg¬ 
ments.  Tsuneko  Fujii,  who  ran  parts  of  this  experiments,  reanalyzed  the  data 
an  1  iound  that  groups  that  estimated  on  log  scales,  odds  or  probabilities , 
made  larger  and  more  accurate  estimates  than  groups  using  non-log  scales. 

These  results  are  reported  in  Fujii  (1907). 

Since  the  experimental  manipulations  in  the  studies  reported  in  this  in¬ 
vestigation  resulted  in  significant  differences  in  the  size  of  assessments, 
all  Ss  who  served  as  information  processors  did  not  perform  as  Bayes's  Theorem 
would  predict.  Thus  the  particular  dimensions  suggested  by  the  Bayesian  ap¬ 
proach  and  incorporated  into  the  taxonomy  do  make  a  difference. 


The  1 in ding  that  both  LR  and  ODDS  assessments  are  extremely  reliable  was 
most  encouraging.  However,  these  analyses  were  performed  on  averaged  data, 
not  individual  data. 

The  results  oi  this  investigation  suggest  some  advice  for  persons  applying 
the  Bayesian  approach  in  real  world  applications.  Don't  have  Ss  give  their 
uncertainty  judgments  on  LOG  scales  if  no  feedback  is  given  to  them  about  the 
implications  of  their  assessments,  at  least  until  there  is  further  testing  of 
the  hypothesis  that  LOG  scales  result  in  larger  assessments  regardless  of  ac¬ 
curacy.  Be  cautious  about  feeding  back  the  posterior  probabilities  of  the  hy¬ 
potheses  under  consideration.  This  leedbank  may  be  tire  most  potent  variable 


lux**** 


1 


studied.  Use  either  the  LR  or  ODDS  response  mode,  whichever  is  more  conve¬ 
nient,  when  you  ask  for  noncumulative  responses.  Verbally  aggregated  responses 
with  no  additional  feedback  result  in  very  conservative  assessments,  and  should 
consequently  be  avoided. 
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